Wikipedia Pagecounts in Amazon S3

Trends have never been this intelligent

Paul Houle

Creator of database animals and bayesian brains

December 05, 2013

Subjective Importance

For a long time I've been fascinated with the problem of subjective importance, that is, ranking topics by how much people think about them.

Two applications for this are particularly clear: (i) if you're selecting topics using typeahead search, you need some way to put popular topics towards the top and (ii) if you're trying to resolve named entities from documents, you need to know how often concepts occur to accurately estimate how likely it is that a snippet of text corresponds to a specific concept.

Link-based subjective importance

A simple and effective form of subjective importance for DBpedia topics is the number of inlinks to a topic in the DBpedia Pagelinks data set. This works well precisely because wikilinks don't require an ontology. This avoids the effect that things that are well ontologized (musicians and cities) aren't privileged over less well ontologized topics (policemen and abstract concepts) that you'd have if you count Freebase inlinks.

An obvious idea is to make a link-based algorithm that is smarter, say by adding PageRank-style reccurrence, or by counting links with different weights depending on the predicate.

The difficulty in doing that, however, is proving that any particular change really improves matters is impossible without some kind of "ground truth", and lacking that ground truth, it wasn't worth putting effort into improving it. I asked a question about this a while ago, but for a long time I had no real way forward.

Direct sampling

Then one day I discovered the Wikipedia pagecounts dataset, which is an hourly records of how many views there have been for everything in every Wikipedia in every language. It even includes views of "Special" pages, user pages, and media items from Wikimedia commons.

Thus, the probability distribution of people's interest in topics can be directly sampled. This could be used as a "ground truth" for evaluating link-based importance scores but, probably, it will be superior for direct use.

The 4-D challenge

An interesting aspect of this data is that it is extremely time-dependent. I downloaded a few weeks of data around August of 2012 and got the startling result that Michael Phelps was the most important person in the world!

This makes sense, of course, because the summer Olympics were on an Michael Phelps really may have been the most interesting person at that time, but to get an answer which is "timeless" one needs to average over a long period of time.

It's clear this data will have biases. For instance, if you look at a recent movie like "The Avengers", you'll see a peak of interest around the time that the movie is in theaters and when it is release to video. A similar peak of interest happened for "Ghostbusters", but that was in 1984, before Wikipedia existed.

On the other hand, time-based information means that we can track the importance of topics over time, discover what is hot now, and use a time-dependent Bayesian prior in text analysis, an unfair advantage which may get named entity resolution finally across the "commercialization valley of death".

The data set in Amazon S3

All of this adds up to about 3 TB of data stored in Amazon S3. This data can be freely downloaded from the Wikimedia foundation, but it took me the whole month of November to download it, and you're not likely to do much better because the server only allows you a limited number of concurrent connections.

Now that the data is in S3, it can be rapidly processed with Amazon Elastic MapReduce. I'm also about to unbox a 4 TB hard drive and mail it to AWS to get a copy of the data in my office. If you're interested in a copy of this data, I can get it quickly to you in this format.

Get involved

Because this data set is of a preliminary nature, I'm not ready to make it available on a requester-paid basis, as I have for the much smaller :BaseKB data sets However, if you are interested in working with this data, I can authorize your AWS key for access, or send you a copy on a hard drive. Please Contact me if you are interested.